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Abstract. In this paper we propose new solution methods for designing tag sets for use in universal 
DNA arrays. First, we give integer linear programming formulations for two previous formalizations of 
the tag set design problem, and show that these formulations can be solved to optimality for instance 
sizes of practical interest by using general purpose optimization packages. Second, we note the benefits of 
periodic tags, and establish an interesting connection between the tag design problem and the problem 
of packing the maximum number of vertex-disjoint directed cycles in a given graph. We show that 
combining a simple greedy cycle packing algorithm with a previously proposed alphabetic tree search 
strategy yields an increase of over 40% in the number of tags compared to previous methods. 


1 Introduction 

Recently developed universal DNA tag arrays [7,13,15] offer a flexible and cost-effective alternative 
to custom-designed DNA arrays for performing a wide range of genomic analyses. A universal tag 
array consists of a set of DNA strings called tags, designed such that each tag hybridizes strongly 
to its own antitag (Watson-Crick complement), but to no other antitag. A typical assay based 
on universal tag arrays performs Single Nucleotide Polymorphism (SNP) genotyping using the 
following steps [4,9]: (1) A set of reporter oligonucleotide probes is synthesized by ligating antitags 
to the 5' end of primers complementing the genomic sequence immediately preceding the SNP. (2) 
Reporter probes are hybridized in solution with the genomic DNA under study. (3) Hybridization of 
the primer part (3' end) of a reporter probe is detected by a single-base extension reaction using the 
polymerase enzyme and dideoxynucleotides fluorescently labeled with 4 different dyes. (4) Reporter 
probes are separated from the template DNA and hybridized to the universal array. (5) Finally, 
fluorescence levels are used to determine which primers have been extended and learn the identity 
of the extending dideoxynucleotides. 

Tag set design involves balancing two conflicting requirements: on one hand we would like a 
large number of tags to allow assaying a large number of biochemical reactions, on the other hand 
we would like the tags to work well for a wide range of assay types and experimental conditions. 

Ben Dor et al. [3] have previously formalized the problem by imposing constraints on antitag- 
to-tag hybridization specificity under a hybridization model based on the classical 2-4 rule, and 
have proposed near-optimal heuristics. In Section 3 we give an integer linear programming (ILP) 
formulation for this problem and its variant in which tags are required to have equal length [14]. 
Empirical results in Section 5 show that these ILP formulations have extremely small integrality 
gap, and can be solved to optimality for instance sizes of practical interest by using general purpose 
optimization packages. 

Previous works on tag set design [3,14] require for substrings that may form a nucleation 
complex and initiate cross hybridization not to be repeated within any selected tag. This constraint 
simplifies analysis, but is not required for ensuring correct tag functionality - what is required is for 
such substrings not to appear simultaneously in two different tags. To our knowledge, no previous 
work has assessed the impact that adding this constraint has on tag set size. In this paper we 
propose two algorithms for designing tag sets while relaxing this constraint. The first one is a 
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modification of the alphabetic tree search strategy in [13,14], The second algorithm stems from 
the observation that periodic tags, particularly those with a short period, use the least amount 
of “resources” and lead to larger tag sets, where the limited resources are in this case minimal 
substrings that can form nucleation complexes (for formal models see Section 2). In Section 4 we 
establish an interesting connection between the tag design problem and the problem of packing the 
maximum number of vertex-disjoint directed cycles in a given graph, and propose a simple greedy 
algorithm for the latter one. Results in Section 5 show that combining the greedy cycle packing 
algorithm with alphabetic tree search strategy yields an increase of over 40% in the number of tags 
compared to previous methods. 

The rest of the paper is organized as follows. In Section 2, we describe the hybridization model, 
give formal problem formulations for several variants of the tag set design problem, and briefly 
review previous work. In Section 3, we give integer program formulations for the problem variant in 
[3] and its variant in which tags are required to have equal length [14]. In Section 4, we describe the 
algorithms for problem variants in which potential nucleation complexes can be appear multiple 
times as substrings of a tag, and prove that packing the maximum number of vertex-disjoint directed 
cycles in a given graph is APX-hard. Finally, we give experimental results in Section 5 and conclude 
with some open problems in Section 6 . 


2 Problem Formulations and Previous Work 

A main objective of universal array designers is to maximize the number of tags, which directly 
determines the number of reactions that can multiplexed using a single array. At the same time, tag 
sets must satisfy a number of stability and non-interaction constraints [ 6 ]. The set of constraints 
depends on factors such as the array manufacturing technology and the intended application. In 
this section we formalize the most important stability and non-interaction constraints using the 
hybridization model in [3]. 

Hybridization model. Hybridization affinity between two oligonucleotides is commonly charac¬ 
terized using the melting temperature, defined as the temperature at which half of the duplexes 
are in hybridized state and the other half are in melted state. However, accurate melting temper¬ 
ature estimation is computationally expensive, e.g., estimating the melting temperature between 
two non-complement ary oligonucleotides using the near-neighbor model of SantaLucia [17] is an 
NP-hard problem [10]. Ben-Dor et al. [3,4] formalized a conservative hybridization model based 
on the observation that stable hybridization requires the formation of an initial nucleation com¬ 
plex between two perfectly complementary substrings of the two oligonucleotides. For nucleation 
complexes, hybridization affinity is modeled using the classical 2-4 rule [18], which estimates the 
melting temperature of the duplex formed by an oligonucleotide with its complement as the sum 
between the number of weak bases (i.e., A and T) and twice the number of strong bases (i.e., G and 
C). 

The weight w{x) of a DNA string x = aia 2 ---ak is defined as w{x) = J2i=iw{ai), where 
w{A) = w{T) = 1 and w{C) = w{G) = 2. Throughout this paper we assume the following c-token 
hybridization model [3]: hybridization between two oligonucleotides takes place only if one oligo 
contains as substring the complement of a substring of weight c or more of the other, where c is 
a given constant. The complement of a string x = 0102 •.. a^, over the DNA alphabet {A, C, T, G} is 
defined as x = 6162 • • • where hi is the Watson-Crick complement of Ok-i+i- 

Hybridization stability. Current industry designs require a predetermined tag length I, e.g., 
GenFlex tag arrays manufactured by Affymetrix use I = 20 [1]. The model proposed in [3] allows 
tags of unequal length and instead require a minimum tag weight of h, for a given constant h. 
In this paper we consider both types of stability constraints, and use the parameter a G {l,h} to 
denote the specific model used for hybridization stability. 



Pairwise non-interaction constraints. A basic constraint in this category is for every antitag 
not to hybridize to non-complementary tags [3]. For a DNA string x and a set of tags T, let Nq~{x) 
denote the number of tags in T that contain x as a substring. Using the c-token hybridization 
model, the antitag-to-tag hybridization constraint is formalized as follows: 

(C) For every feasible tag set T, Nq-{x) < 1 for every DNA string x of weight c or more. 

In many assays based on universal tag arrays it is also required to prevent antitag-to-antitag 
hybridization, since the formation of such antitag-to-antitag duplexes or antitag hair-pin structures 
prevents reporter probes from performing their function in the solution-based hybridization steps 
[6,14]. The combined constraints on antitag hybridization are formalized as follows 
(C) For every feasible tag set T, Nq~{x) + Nq~{x) < 1 for every DNA string x of weight c or more. 

In the following we use the parameter (3 G {C,C} to specify the type of pairwise hybridization 
constraints. 

Substring occurrences within a tag. Previous works on DNA tag set design [3,14] have imposed 
the following c-token uniqueness constraint in addition to constraints (C) and (C); a DNA string of 
weight c or more can appear as a substring of a feasible tag at most once. This uniqueness constraint 
has been added purely for ease of analysis (e.g., it is the key property enabling the DeBruijn 
sequence based heuristics in [3]), and is not required for ensuring correct assay functionality. To 
our knowledge, no previous work has assessed the impact that adding this constraint has on tag 
set size. In the following we will use the parameter 7 G {1, multiple} to specify whether or not the 
c-token uniqueness constraint is enforced. 

For every a G {I, h}, (3 G {C, C}, and 7 G {1, multiple}, the maximum tag set design problem with 
constraints a, (3,^, denoted MTSDP(q;|/3 |7), is the following; given constants c and l/h, find a tag 
set of maximum cardinality satisfying the constraints. 

Previous work on tag set design. Ben-Dor et al. [3] formalized the c-token model for oligonu¬ 
cleotide hybridization and studied the MTSDP(/i,|C'|l) problem. They established a constructive 
upperbound on the optimal number of tags for this formulation, and gave a nearly optimal tag 
selection algorithm based on DeBruijn sequences. Similar upper bounds are established for the 
MTSDP(/|C'|1) and MTSDP(>f:|C'|l) problems in [14], which also extends a simple alphabetic tree 
search strategy originally proposed in [13] to handle all considered problem variants. 

For a comprehensive survey of hybridization models, results on the associated formulations for 
the tag set design problem, and further motivating applications in the area of DNA computing, we 
direct the reader to [ 6 ]. 

3 Integer Linear Programming Formnlations for MTSDP(*|C|1) 

Before stating our integer linear program formulation, we introduce some additional notations. 

Following [3], a DNA string x of weight c or more is called a c-token if all its proper suffixes 
have weight strictly less than c. Clearly, it suffices to enforce constraint (C) for all c-tokens x. Let 
N denote the number of c-tokens, and C = {ci,..., cat} denote the set of all c-tokens. The results 
in [3] imply that N = 0((1 -|- Note that the weight of a c-token can be either c or c-|- 1, the 

latter case being possible only if the c-token starts with a strong base (G or C). We let Cq C C denote 
the set of c-tokens of weight c -|- 1 that end with a weak base, i.e., c-tokens of the form S<c — 2>W, 
where W (S) denote a weak (strong) base, and <c — 2> denotes an arbitrary string of weight c — 2. 
We also let C 2 C C denote the set of c-tokens of weight c that end with a strong base, i.e., c-tokens 
of the form <c — 2>S. 

Clearly, there is at most one c-token ending at every letter of a tag. It is easy to see that each 
c-token x G Cq contains a proper prefix which is itself a c-token, and therefore x cannot be the first 



c-token of a tag, i.e., cannot be the c-token with the leftmost ending. All other c-tokens can appear 
as first c-tokens. When a c-token in C \ (Cq U C 2 ) is the first in a tag, then it must be a prefix of 
the tag. On the other hand, tokens in C 2 can be the first both in tags that they prefix and in tags 
in which they are preceded by a weak base not covered by any c-token. 

The ILP formulation for MTSDP(f|C'|l) uses an auxiliary directed graph G = (y,E) with 
V = {s,f} U Ui<i<Ar^) where Vi = {v^ \ \ci\ < k < 1}. G has a directed arc from uf to 
for every triple i, j, A: such that |ci| < A: < f — 1 and Cj is obtained from Cj by appending a single 
nucleotide and removing the maximal prefix that still leaves a valid c-token. Finally, G has an arc 
from s to every v G Vfirst, where Vfirst = | Cj G C \ Cq} U \ Ci G C 2 }, and an arc from 

v\ to t for every 1 < i < A. 

We claim that, for c < I, MTSDP(/|C'|1) can be reformulated as the problem of finding the 
maximum number of s-t paths in G that collectively visit at most one vertex vf for every i. Indeed, 
let P be an s-t path and vf be the vertex following s in P. If A: = |ci|, we associate to P the tag 
obtained by concatenating c* with the last letters of the c-tokens corresponding to the subsequently 
visited vertices, until reaching t. Otherwise, if A: = |ci| -|- 1 (which implies that Cj G C 2 ) we associate 
to P the two tags obtained by concatenating either A or T with Ci and the last letters of subsequently 
visited c-tokens. The claim follows by observing that at most one of the tags associated with each 
path can be used in a feasible solution. 

Our ILP formulation can be viewed as a generalized version of the integer maximum flow prob¬ 
lem in which unit capacity constraints are imposed on sets of vertices of G instead of individual 
vertices. The formulation uses 0/1 variables and t/e for every every vertex v G P \ {s,f}, re¬ 
spectively arc e G E. These variables are set to 1 if the corresponding vertex or arc is visited by 
an s-t path corresponding to a selected tag. Let in{v) and out{v) denote the set of arcs entering, 
respectively leaving vertex v. The integer program can then be written as follows: 


maximize ^ Xy 


(1) 

subject to 

Xv= 2/e 

= 2/e, VGV\{s,t} 

(2) 

eGin(v) 

e£out(v) 


^ < I, 

l<i<N 

( 3 ) 

v£Vi 

Xy,ye G {0,1}, 

V G V \ {s,t},e G E 

( 4 ) 


Constraints (2) ensure that variables ye set to 1 correspond to a set of s-t paths, and that a variable 
Xy is set to 1 if and only if one of these paths passes through vf Antitag-to-tag hybridization 
constraints (C) and c-token uniqueness are enforced by (3). Finally, the objective (1) corresponds 
to maximizing the number of selected tags, since the shortest prefix of a tag that is a c-token must 
belong to C \ Cq. 

For a token c* = CjX G Cq, where X G {A,T}, let Cj = c^a. Since both Cj and cj contain Cj 
as a prefix, and Cj can appear at most once in a feasible tag set T, it follows that at most one 
of them can appear in T. Therefore, the following valid inequality can be added to the the ILP 
formulation (l)-(4) to improve its integrality gap (i.e., the gap between the value of the optimum 
integer solution and that of the optimal fractional relaxation); 

^ < 1, Ci G Cq, Cj = Ci, i < j (5) 

veViUVj 

^ Variables Xv can be eliminated by replacing them with the corresponding sums of Xe’s; we use them here merely 
for improving readability. ILP sizes reported in Section 5 refer to the equivalent reduced formulations obtained by 
eliminating these variables. 



The formulation of MTSDP(/|C'|1) has exactly the same objective and constraints for a slightly 
different graph G. Let us define the tail weight of a c-token C, denoted tail{C), as the weight of 
C’s last letter. Also, let hi = h if Ci has a tail weight of 1 and hi = h + 1 Ci has a tail weight 
of 2. We will require that every tag ending with token Cj has total weight of at most hi] it is easy 
to see that this constraint is not affecting the size of the optimum tag set. We now define the 
graph G = {V,E) with V = {s,t} U Ui<i<Af where Vi = {uf | w{ci) < k < hi}. G contains a 

directed arc from vf to fQj- every triple i, j, A: such that \ci\ < k < hi — tail{ci) and cj is 

obtained from a by appending a single nucleotide and removing the maximal prefix that still leaves 
a valid c-token. Finally, G contains arcs from s to every v G Vfirst, where Vfirst is now equal to 

I Ci G C \ Co} U I Ci G C 2 }, plus arcs from every vf to t for every 1 < f < and 

hi — tail{ci) < k < hi- 

4 Algorithms for MTSDP(*| * \multiple) 

In the following we describe two algorithms for MTSDP(/|C'|mu/tzp/e); both algorithms can be easily 
adjusted to handle the other MTSDP(>f=| * \multiple) variants. The first algorithm (see Figure 1 for a 
detailed pseudocode) is similar to the alphabetic tree search algorithms proposed for MTSDP(/|C'|1) 
in [14]. The algorithm performs an alphabetical traversal of a 4-ary tree representing all 4^ possible 
tags, skipping over subtrees rooted at internal vertices that correspond to tag prefixes including 
unavailable c-tokens. The difference from the MTSDP(^|C'|1) algorithm in [14] lies in the strategy 
used to mark c-tokens as unavailable. While the algorithm in [14] marks a c-token C as unavailable 
as soon as it incorporates it in the current tag prefix (changing C’s status back to “available” when 
forced to backtrack past C’s tail), the algorithm in Figure 1 marks a c-token as unavailable only 
when a complete tag is found. 

Note that the alphabetic tree search algorithm produces a maximal feasible set of tags T, i.e., 
there is no tag t such that T U {t} remains feasible for MTSDF{l\G\multiple). Hence, every tag of 
an optimal solution must share at least one c-token with tags in T. Since every tag of T has at 
most I — c/2 -|- 1 c-tokens, it follows that the alphabetic tree algorithm (and indeed, every algorithm 
that produces a maximal feasible set of tags) has an approximation factor of I — c/2 -|- 1. 

We call a tag t periodic if t is the length I prefix of an infinite string x°°, where x is a DNA 
string with |x| < \t\. (Note that a periodic tag t is not necessarily the concatenation of an integer 
number of copies of its period x as in the standard definition of string periodicity [12].) 

The following lemma shows that tag set design algorithms can restrict the search to two simple 
classes of tags. 

Lemma 1. For every c and I, there exists an optimal tag set T in whieh every tag has the unique¬ 
ness property or is periodie."^ 

Proof. Let T be an optimal tag set. Assume that T contains a tag t that does not have the 
uniqueness property, and let cq,..., cq be the sequence of c-tokens occurring in f, in left to right 
order. Since t does not have the uniqueness property, there exist indices I < j < j' < ik such that 
Cij = Ci.,. Let t' be the tag formed by taking the first I letters of the infinite string with c-token 
sequence {ci-,... ,Ci )°°; note that is a periodic tag. Since c-tokens Ci.,..., Ci do not appear 
in the tags of T \ {f}, it follows that (T \ {f}) U {f'j is also optimal. Repeated application of this 
operation yields the lemma. □ 

Note that a periodic tag whose shortest period has length p contains as substrings exactly p 
c-tokens, while tags with the uniqueness property contain between f — c -|- 1 and I — c/2 -|- 1 c- 
tokens. Therefore, of the two classes of tags in Lemma 1, periodic tags (particularly those with 
short periods) make better use of the limited number of available c-tokens. 

Note that the two classes of tags are not disjoint, since there exist periodic tags that have the uniqueness property. 


2 



Input: Positive integers c and I, c< I 

Output: Feasible MTSDP{l\C\multiple) solution T 


Mark all c-tokens as available 
For every i £ {1,2,... ,1}, Bi <— A 
T ^ 0; Finished <— 0; pos ^ 1 
While Finished = 0 do 

While the weight of B 1 B 2 ... Bpos < c do 
pos ^ pos + 1 
EndWhile 

If the c-token ending B 1 B 2 ... Bpos is available then 
If pos = I then 

T ^Tu{BiB2...Bi} 

Mark all the c-tokens of Bg ... Bi as unavailable 

pos ^ [the position where the hrst c-token of Bi B 2 ... Bi ends] 

7 ^ I 1 < i < pos, Bi 7 ^ G} 

If 7 = 0 then 
Finished ^ 1 
Else 

pos ^ niax{7} 

Bi ■<— A for all i £ {pos -|- 1 ,..., 1} 

Bpos ^ nextbase{Bpos) 

Endlf 

Else 

pos <— pos -I- 1 

Endlf 

Else 

7 ^ {i I 1 < i < pos, Bi / G} 

If 7 = 0 then 
Finished ^ 1 
Else 

pos <— niax{7} 

Bi ^ A for all i £ {pos -I- 1,..., Z} 

Bpos ^ nextbase{Bpos) 

Endlf 

Endlf 

EndWhile 


Fig. 1. The alphabetic tree search algorithm for MTSDP{l\C\multiple). The nextbase(-) function is dehned by 
nextbase{A) = T, nextbase{l) = C, and nextbase{C) — G. 


Each periodic tag corresponds to a directed cycle in the graph which has C as its vertex set, 
and in which a token q is connected by an arc to token Cj iff Ci and cj can appear consecutively in a 
tag, i.e., iff Cj is obtained from Cj by appending a single nucleotide and removing the maximal prefix 
that still leaves a valid c-token. Clearly, a vertex-disjoint packing of n cycles in He yields a feasible 
solution for M.TSDF{l\C\multiple) consisting of n tags, since we can extract at least one tag of 
length I from each cycle, and tags extracted from different cycles do not have common c-tokens. 
This motivates the following: 

Maximum Vertex-Disjoint Directed Cycle Packing Problem: Given a directed graph G, 
find a maximum number of vertex-disjoint directed cycles in G. 

The next theorem shows that Maximum Vertex-Disjoint Directed Cycle Packing in 
arbitrary graphs is unlikely to admit a polynomial approximation scheme. A stronger inapproxima- 
bility results was recently established for arbitrary graphs by Salavatipour and Verstraete [16], 
who proved that there is no 0(log^“^ n)-approximation for Maximum Vertex-Disjoint Di¬ 
rected Cycle Packing unless NP C DTIME{2'P°^y^°3'^). On the positive side, Salavatipour 
and Verstraete showed that Maximum Vertex-Disjoint Directed Cycle Packing can be 





(a) 


'OO' 

(b) 


Fig. 2. Vertices and arcs added to G(0) for (a) variable Xi, and (b) clause h V Ij. 

approximated within a factor of 0{y/n) via linear programming techniques, matching the best 
approximation factor known for the edge-disjoint version of the problem [11]. 

Theorem 1. Maximum Vertex-Disjoint Directed Cycle Packing is APX-hard even for 
regular direeted graphs with in-degree and out-degree of 2. 


Proof. We use a reduction from the MAX-2-SAT-3 problem, similar to the one in [8]. An instance f 
of MAX-2-SAT-3 consists of a set {ci,..., Cm} of disjunctive clauses over a ground set {xi,..., Xn} 
of variables. Each clause consists of at most 2 literals (variables or negations of variable), and each 
variable appears in at most 3 clauses, counting both negated and non-negated occurrences. The 
objective is to find a truth assignment that satisfies as many of the clauses as possible. It is known 
that MAX-2-SAT-3 is APX-hard [2,5]. 

Let nii denote the number of occurrences of variable Xi in a given instance of MAX-2-SAT-3. 
We construct, in polynomial time, a directed graph G{(l)) as follows. For each variable Xi we add 
to G a directed cycle Ci of length plus 2mj additional vertices alternatively labeled by Xi and 
Xi, used to close a directed cycle of length 3 with each arc of Ci, as in Figure 2(a). For each unary 
clause we pick a distinct vertex labeled by the negation of the respective literal and attach a loop 
to it. Finally, for each 2-literal clause c we pick 2 vertices labeled by the negations of the literals 
of c, again without reusing labeled vertices between clauses, and use a new vertex to connect them 
via two length-2 cycles as in Figure 2(b). Note that, for every i, at least 2X]r=i of labeled 
vertices remain incident to a single cycle; we will refer to these as “free” labeled vertices. 

We claim that every truth assignment that makes k clauses of f true can be converted in 
polynomial time into a set oi k 2X]iLi vertex disjoint cycles of C{(t)), and vice-versa. Indeed, 
for a given truth assignment, select (1) the 2mi length-3 cycles passing through nodes labeled by Xi 
for every variable Xi that is set to true, (2) the 2mj length-3 cycles passing through nodes labeled 
by Xi for every variable Xi that is set to false, and (3) the loop or length-2 cycle passing through a 
labeled node corresponding to a false literal. It is easy to verify that these cycles are vertex-disjoint. 

Conversely, let C be a set of /c -|- 2 vertex disjoint cycles of C{4>). If any of the cycles Ci 

is in C, we replace it by the length-3 cycle passing through a free labeled vertex. Similarly, if any 
of the cycles in C visits two of the arcs of a 3-cycle (or one of the arcs of a 2-cycle), we replace it 
by the 3-cycle (respectively 2-cycle) itself. After this transformation we have a set of A: + 2 
vertex-disjoint loops, 2-cycles, and 3-cycles. We say that a set of cycles is eonsistent if only one of 
the labels Xi, Xi appear in C for every i. If C is consistent, we choose a truth assignment that makes 
all literals corresponding to labels in C true. It is easy to see that at least k of the cycles in C must 
be loops and 2-cycles, and clauses corresponding to these cycles are satisfied by the above truth 
assignment. 

Otherwise, we make C consistent by repeating the following transformation. Let i be an index 
for which both Xi and Xi appear in C. Without loss of generality, assume that Xi appears in only 
one clause of f) (recall that, together, Xi and Xi can appear in at most 3 clauses). It follows that 
there is a single loop or 2-cycle C £ C visiting a vertex labeled by Xi - all other vertices labeled by 




Table 1. ILP results for MTSDP(1|C'|1), i.e., tag set design with specified tag length I, antitag-to-tag hybridization 
constraints, and a unique copy of each c-token allowed in a tag. 


1 

C 

#tags 

Upper Bounds 

LP/ILP statistics 



W 

ILP 

LP 

-iTir 

#constr 

/invars 

/(^non-zero 

LP time 

ILP time 

10 

T 

7 

8 

8.57 

9 

406 

1878 

6004 

0.13 

0.71 

10 

5 

23 

28 

28.00 

29 

1008 

4600 

14596 

2.27 

5.85 

10 

6 

67 

85 

85.60 

96 

2434 

10940 

34470 

11.40 

98.25 

10 

7 

196 

259 

259.67 

328 

5808 

25422 

79274 

86.70 

586.67 

10 

8 

655 

- 

853.33 

1194 

13554 

57138 

175492 

552.74 

- 

20 

T 

3 

3 

3.53 

3 

926 

4638 

15244 

1.05 

58.46 

20 

5 

9 

10 

10.50 

11 

2448 

12240 

40076 

13.72 

381.33 

20 

6 

26 

29 

29.87 

32 

6354 

31860 

104270 

182.96 

12448.61 

20 

7 

75 

- 

88.00 

93 

16528 

82662 

270194 

2675.68 

- 

20 

8 

213 

- 

257.23 

275 

42834 

213578 

697292 

134525.81 

- 


Table 2. ILP results for MTSDP(h|C'il), i.e., tag set design with specified minimum tag weight h, antitag-to-tag 
hybridization constraints, and a unique copy of each c-token allowed in a tag. 


h 

c 

#tags 

Upper Bounds 

LP/ILP statistics 



W 

ILP 

LP 


^constr 

/invars 

//^non-zero 

LP time 

ILP time 

15 

T 

6 

7 

7.00 

7 

610 

2966 

9612 

0.45 

9.04 

15 

5 

18 

21 

21.09 

21 

1550 

7456 

23998 

5.66 

117.62 

15 

6 

47 

63 

63.20 

63 

3830 

18322 

58752 

54.43 

2665.39 

15 

7 

149 

192 

192.00 

192 

9406 

44416 

141638 

544.95 

3644.85 

15 

8 

460 

- 

588.00 

590 

22766 

105746 

334904 

7153.87 

- 

28 

T 

3 

3 

3.30 

3 

1286 

6554 

21624 

1.88 

132.78 

28 

5 

8 

9 

9.67 

9 

3422 

17388 

57122 

34.66 

1137.21 

28 

6 

22 

27 

27.48 

27 

8926 

45518 

149492 

392.42 

18987.09 

28 

7 

64 

- 

78.55 

78 

23342 

118828 

389834 

7711.41 

- 

28 

8 

175 

- 

- 

224 

60830 

309118 

1013244 

- 

- 


Xi are free. Since the XiS and XiS alternate around Cj, the cycles going through vertices labeled by 
Xi can be replaced by at least the same number of 3-cycles going through vertices labeled by Xj. 

To complete the proof of the theorem, notice that the optimum number of satisfiable clauses, 
kopt, is at least m/2, since we can repeatedly assign a variable such that at least half of the 
clauses containing it are satisfied. Hence, ^ 2m < 4/copf. If there exists a polynomial time 

algorithm with an approximation factor of for Maximum Vertex-Disjoint Directed Cycle 
Packing, we can run it on G(i/)) to get a set C of at least k + ^ 

vertex disjoint cycles, and then convert C as above into a truth assignment satisfying k > ^Y^^opt 
clauses of (/>. □ 

We use a simple greedy algorithm to solve Maximum Vertex-Disjoint Directed Cycle 
Packing for the graph He- we enumerate possible tag periods in pseudo-lexicographic order, and 
check for each period if all c-tokens are available for the resulting tag. We refer to this algorithm 
as the greedy cycle packing algorithm^ since it is equivalent to packing cycles greedily in order of 
length. 

5 Experimental results 

Tables 1 and 2 give ILP statistics (number of constraints, number of variables, and number of non¬ 
zero coefficients), LP and ILP runtime, and LP and ILP solution values for MTSDP(/|C'|1) and 
MTSDP(/i|C'|l). We also include the upper bounds established in [14] and [3] for these problems, 
and the number of tags found by using the alphabetic tree search algorithm in [14]. We solved 




Table 3. Results for MTSDP(*|C|mMZtiple), i.e., tag set design with antitag-to-tag hybridization constraints and 
mnltiple copies of a c-token allowed in a tag. 


l/h 

c 

One c-token copy 

Mnltiple c-token copies 

Algorithm in [14J 

Tree search 

Cycle packing -|- Tree search 

tags 

c-tokens 

tags 

c-tokens 

tags 

c-tokens 

% cyclic 


4 

3 

51 

14 

59 

17 

40 

100.0 


5 

9 

146 

31 

165 

40 

140 

100.0 


6 

26 

404 

53 

433 

72 

293 

98.6 

1 = 20 

7 

75 

1100 

124 

1179 

178 

928 

99.4 


8 

213 

2976 

281 

3095 

383 

2411 

97.1 


9 

600 

7931 

711 

8230 

961 

7102 

96.9 


10 

1667 

20771 

1835 

21400 

2344 

19691 

95.1 


4 

3 

58 

14 

61 

17 

40 

100.0 


5 

8 

150 

32 

174 

40 

140 

100.0 


6 

22 

398 

44 

432 

72 

300 

98.6 

h> 28 

7 

64 

1119 

118 

1200 

178 

934 

99.4 


8 

175 

2918 

239 

3037 

379 

2405 

96.6 


9 

531 

8431 

632 

8622 

943 

6969 

96.5 


10 

1428 

21707 

1570 

22145 

2260 

19270 

94.1 


Table 4. Results for MTSDP{*\C\multiple), i.e., tag set design with both antitag-to-tag and antitag-to-antitag 
hybridization constraints and multiple copies of a c-token allowed in a tag. 


l/h 

c 

One c-token copy 

Mnltiple c-token copies 

Algorithm in [14J 

Tree search 

Cycle packing -|- Tree search 

tags 

c-tokens 

tags 

c-tokens 

tags 

c-tokens 

% cyclic 


4 

1 

17 

10 

35 

10 

25 

100.0 


5 

4 

65 

17 

83 

23 

85 

100.0 


6 

13 

200 

30 

241 

41 

171 

97.6 

1 = 20 

7 

37 

537 

68 

585 

97 

512 

99.0 


8 

107 

1480 

147 

1619 

202 

1268 

98.0 


9 

300 

3939 

362 

4124 

512 

3799 

96.3 


10 

844 

10411 

934 

10869 

1204 

10089 

95.8 


4 

1 

22 

10 

36 

10 

25 

100.0 


5 

4 

74 

17 

84 

23 

85 

100.0 


6 

12 

213 

29 

238 

41 

178 

97.6 

h> 28 

7 

32 

559 

64 

586 

97 

518 

99.0 


8 

90 

1489 

135 

1632 

199 

1238 

98.0 


9 

263 

4158 

329 

4314 

504 

3760 

95.8 


10 

714 

10837 

809 

11250 

1163 

9937 

93.6 


all integer programs and their fractional relaxations using the CPLEX 9.0 commercial solver with 
default parameters run using a single CPU on a dual 2.8 GHz Dell PowerEdge 2600 Linux server. 
Missing entries did not complete in 10 hours. 

The ILP solutions can be found in practical time for small values of c, which are appropriate 
for universal tag array applications, such as the emerging microfluidics-based labs-on-a-chip, where 
moderate multiplexing rates are sufficient and ensuring high hybridization stringency is costly. For 
all cases where the optimum could be computed, the difference between the optimal fractional and 
integer solution values was smaller than 1, indicating why CPLEX can solve to optimality these ILPs 
despite their size. Furthermore, ILP results confirm the extremely high quality of the upperbound 
established for MTSDP(/i|C'|l) in [3]; the upperbound established in [14] for MTSDP(/|C'|1) appears 
to be somehow weaker. 

Tables 3 and 4 give the results obtained for MTSDP(>i:| * \multiple) by the alphabetic tree search 
algorithm in Figure 1 respectively by the greedy cycle packing algorithm (in our implementation, we 
impose an upper bound of 15 on the length of the cycles that we try to pack) followed by running the 
alphabetic tree search algorithm with the c-tokens occurring in the selected cycles already marked 





as unavailable. Performing cycle packing significantly improves the results compared to running the 
alphabetic tree search algorithm alone; as shown in the tables, most of the resulting tags are found 
in the cycle packing phase of the combined algorithm. 

Across all instances, the combined algorithm increases the number of tags by at least 40% 
compared to the MTSDP(>i=| * |1) algorithm in [14]; the improvement is much higher for smaller 
values of c. Quite notably, although the number of tags is increased, the tag sets found by the 
combined algorithm use a smaller total number of c-tokens. Thus, these tag sets are less likely to 
cross-hybridize to the primers used in the reporter probes, enabling higher tag utilization rates 
during tag assignment [4,14]. 

6 Conclusions 

In this paper we proposed new solution methods for designing tag sets for universal DNA arrays. 
We have shown that optimal solutions can be found in practical time for moderate problem sizes 
by using integer linear programming, and that the use of periodic tags leads to increases of over 
40% in the number of tags, with simultaneous increases in effective tag utilization rates during tag 
assignment. Our algorithms use simple greedy strategies, and can be easily modified to incorpo¬ 
rate additional practical design constraints, such as preventing the formation of hairpin secondary 
structures, or disallowing specific nucleotide sequences such as runs of 4 identical nucleotides [13]. 

An interesting open problem is to find tight upper bounds and exact methods for the MTSDP(*[>(= 
\multiple) formulations. Settling the approximation complexity of Maximum Vertex-Disjoint 
Directed Cycle Packing is another interesting problem. 
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